How many rows and columns?

Tells us we have 10841 rows and 12 columns.

We can already see that there are some data issues that we need to fix. In the Ratings and Type columns there are NaN (Not a number values) and in the Price column we have dollar signs that will cause problems.

Finding a random rows to explore our data.

Sample() Method to Explore our data:

The .sample(n) method will give us n random rows. This is another handy way to inspect our DataFrame.

Data Cleaning: Removing NaN Values and Duplicates

Remove the columns called Last_Updated and Android_Version from the DataFrame. We will not use these columns.

To remove the unwanted columns, we simply provide a list of the column names ['Last_Updated', ‘Android_Ver'] to the .drop() method. By setting axis=1 we are specifying that we want to drop certain columns.

Find a NaN values:

How many rows have a NaN value (not-a-number) in the Rating column? Create DataFrame called df_apps_clean that does not include these rows.

To find and remove the rows with the NaN values we can create a subset of the DataFrame based on where .isna() evaluates to True. We see that NaN values in ratings are associated with no reviews (and no installs). That makes sense.

We can drop the NaN values with .dropna():

This leaves us with 9,367 entries in our DataFrame. But there may be other problems with the data too.

Are there any duplicates in data?

Are there any duplicates in data? Check for duplicates using the .duplicated() function.

How many entries can you find for the "Instagram" app? Use .drop_duplicates() to remove any duplicates from df_apps_clean.

We can actually check for an individual app like ‘Instagram’ by looking up all the entries with that name in the App column.

So how do we get rid of duplicates? Can we simply call .drop_duplicates()?

What else should I know about the data?

So we can see that 13 different features were originally scraped from the Google Play Store.

Obviously, the data is just a sample out of all the Android apps. It doesn't include all Android apps of which there are millions.

I’ll assume that the sample is representative of the App Store as a whole. This is not necessarily the case as, during the web scraping process, this sample was served up based on geographical location and user behaviour of the person who scraped it - in our case Lavanya Gupta.

The data was compiled around 2017/2018. The pricing data reflect the price in USD Dollars at the time of scraping. (developers can offer promotions and change their app’s pricing).

I’ve converted the app’s size to a floating-point number in MBs. If data was missing, it has been replaced by the average size for that category.

The installs are not the exact number of installs. If an app has 245,239 installs then Google will simply report an order of magnitude like 100,000+. I’ve removed the '+' and we’ll assume the exact number of installs in that column for simplicity.

Preliminary Exploration: The Highest Ratings, Most Reviews, and Largest Size

The Highest Rating:

Identify which apps are the highest rated. What problem might you encounter if you rely exclusively on ratings alone to determine the quality of an app?

Only apps with very few reviews (and a low number on installs) have perfect 5 star ratings (most likely by friends and family).

The largest Size:

What's the size in megabytes (MB) of the largest Android apps in the Google Play Store. Based on the data, do you think there could be a limit in place or can developers make apps as large as they please?

Here we can clearly see that there seems to be an upper bound of 100 MB for the size of an app. A quick google search would also have revealed that this limit is imposed by the Google Play Store itself. It’s interesting to see that a number of apps actually hit that limit exactly.

The Highest Numer of Reviews:

Which apps have the highest number of reviews? Are there any paid apps among the top 50?

If you look at the number of reviews, you can find the most popular apps on the Android App Store. These include the usual suspects: Facebook, WhatsApp, Instagram etc. What’s also notable is that the list of the top 50 most reviewed apps does not include a single paid app!

Data Visualisation with Plotly: Create Pie and Donut Charts

All Android apps have a content rating like “Everyone” or “Teen” or “Mature 17+”. Let’s take a look at the distribution of the content ratings in our dataset and see how to visualise it with plotly - a popular data visualisation library that you can use alongside or instead of Matplotlib.

Count the number of occurrences:

The first step in creating charts with plotly is to import plotly.express. This is the fastest way to create a beautiful graphic with a minimal amount of code in plotly.

To create a pie chart we simply call px.pie() and then .show() the resulting figure. Plotly refers to all their figures, be they line charts, bar charts, or pie charts as graph_objects.

Creating a graph object (Pie):

Let’s customise our pie chart. Looking at the .pie() documentation we see a number of parameters that we can set, like title or names. https://plotly.com/python-api-reference/generated/plotly.express.pie.html

plotly.express.pie(data_frame=None, names=None, values=None, color=None, color_discrete_sequence=None, color_discrete_map=None, hover_name=None, hover_data=None, custom_data=None, labels=None, title=None, template=None, width=None, height=None, opacity=None, hole=None)

If you’d like to configure other aspects of the chart, that you can’t see in the list of parameters, you can call a method called .update_traces(). In plotly lingo, “traces” refer to graphical marks on a figure. Think of “traces” as collections of attributes. Here we update the traces to change how the text is displayed.

Donut Chart:

To create a donut 🍩 chart, we can simply add a value for the hole argument:

Numeric Type Conversions for the Installations & Price Data

How many apps had over 1 billion (that's right - BILLION) installations? How many apps just had a single install?

Check the datatype of the Installs column:

To check the data types you can either use .describe() on the column or .info() on the DataFrame.

The "Installs" datatype is Name: Installs, dtype: object

Here we can see: 5 Installs 8199 non-null object

Both of these show that we are dealing with a non-numeric data type. In this case, the type is "object".

Convert the number of installations (the Installs column) to a numeric data type:

If we take two of the columns, say Installs and the App name, we can count the number of entries per level of installations with .groupby() and .count(). However, because we are dealing with a non-numeric data type, the ordering is not helpful. The reason Python is not recognising our installs as numbers is because of the comma (,) characters.

Remove the comma (,) from Instals values:

We can remove the comma (,) character - or any character for that matter - from a DataFrame using the string’s .replace() method. Here we’re saying: “replace the , with an empty string”. This completely removes all the commas in the Installs column. We can then convert our data to a number using .to_numeric().

Convert data to numeric type:

We can then convert our data to a number using .to_numeric().

Convert the price column to numeric data:

We can see that the data type of Price column is: 7 Price 8199 non-null object

Delete $ from Price:

We can delete $ from our price and convert it to numeric data type:

Convert Price to number:

Remove all apps that cost more than $250 from the df_apps_clean DataFrame:

Here we can see that 5 top apps - are the same app!

Add a column called 'Revenue_Estimate' to the DataFrame:

Add a column called 'Revenue_Estimate' to the DataFrame. This column should hold the price of the app times the number of installs. What are the top 10 highest-grossing paid apps according to this estimate? Out of the top 10, how many are games?

What’s going on here? There are 15 I am Rich Apps in the Google Play Store apparently. They all cost 300 or more, which is the main point of the app. The story goes that in 2008, Armin Heinrich released the very first I am Rich app in the iOS App Store for 999.90. The app does absolutely nothing. It just displays the picture of a gemstone and can be used to prove to your friends how rich you are. Armin actually made a total of 7 sales before the app was hastily removed by Apple. Nonetheless, it inspired a bunch of copycats on the Android App Store, but if you search today, you’ll find all of these apps have disappeared as well. The high installation numbers are likely gamed by making the app was available for free at some point to get reviews and appear more legitimate.

Leaving this bad data in our dataset will misrepresent our analysis of the most expensive 'real' apps. Here’s how we can remove these rows:

Delete All Price positions which are > 250:

When we look at the top 5 apps now, we see that 4 out of 5 are medical apps.

Find the highest Grossing price:

We can work out the highest grossing paid apps now. All we need to do is multiply the values in the price and the installs column to get the number:

The top spot of the highest-grossing paid app goes to … Minecraft at close to $70 million. It’s quite interesting that Minecraft (along with Bloons and Card Wars) is actually listed in the Family category rather than in the Game category. If we include these titles, we see that 7 out the top 10 highest-grossing apps are games. The Google Play Store seems to be quite flexible with its category labels.

Plotly Bar Charts & Scatter Plots: The Most Competitive & Popular App Categories

If you were to release an app, would you choose to go after a competitive category with many other apps? Or would you target a popular category with a high number of downloads? Or perhaps you can target a category which is both popular but also one where the downloads are spread out among many different apps. That way, even if it’s more difficult to discover among all the other apps, your app has a better chance of getting installed, right? Let’s analyse this with bar charts and scatter plots and figure out which categories are dominating the market.

We can find the number of different categories like so:

Which shows us that we there are 33 unique categories.

To calculate the number of apps per category we can use our old friend .value_counts():

Or we can check first 10 categories:

To visualise this data in a bar chart we can use the plotly express (our px) bar() function:

Based on the number of apps, the Family and Game categories are the most competitive. Releasing yet another app into these categories will make it hard to get noticed.

But what if we look at it from a different perspective? What matters is not just the total number of apps in the category but how often apps are downloaded in that category. This will give us an idea of how popular a category is. First, we have to group all our apps by category and sum the number of installations:

Then we can create a horizontal bar chart, simply by adding the orientation parameter:

We can also add a custom title and axis labels like so:

Now we see that Games and Tools are actually the most popular categories. If we plot the popularity of a category next to the number of apps in that category we can get an idea of how concentrated a category is. Do few apps have most of the downloads or are the downloads spread out over many apps?

let’s use plotly to create a scatter plot:

Let's create a DataFrame that has the number of apps in one column and the number of installs in another:

  1. First, we need to work out the number of apps in each category (similar to what we did previously).

Then we can use .merge() and combine the two DataFrames:

Now we can create the chart. Note that we can pass in an entire DataFrame and specify which columns should be used for the x and y by column name.

Extracting Nested Column Data using .stack():

Let’s turn our attention to the Genres column. This is quite similar to the categories column but more granular.

How many different types of genres are there? Can an app belong to more than one genre? Check what happens when you use .value_counts() on a column with nested values? See if you can work around this problem by using the .split() function and the DataFrame's .stack() method.

Working with Nested Column Data:

If we look at the number of unique values in the Genres column we get 114. But this is not accurate if we have nested data like we do here. We can see this using .value_counts() and looking at the values that just have a single entry. There we see that the semi-colon (;) separates the genre names.

We somehow need to separate the genre names to get a clear picture. This is where the string’s .split() method comes in handy. After we’ve separated our genre names based on the semi-colon, we can add them all into a single column with .stack() and then use .value_counts().

This shows us we actually have 53 different genres.

Let's create this chart with the Series containing the genre data?

Try experimenting with the built-in colour scales in Plotly. You can find a full list here: https://plotly.com/python/builtin-colorscales/

Find a way to set the colour scale using the color_continuous_scale parameter.

Find a way to make the colour axis disappear by using coloraxis_showscale.

Grouped Bar Charts and Box Plots with Plotly:

Now that we’ve looked at the total number of apps per category and the total number of apps per genre, let’s see what the split is between free and paid apps.

We see that the majority of apps are free on the Google Play Store. But perhaps some categories have more paid apps than others. Let’s investigate. We can group our data first by Category and then by Type. Then we can add up the number of apps per each type. Using as_index=False we push all the data into columns rather than end up with our Categories as the index.

How many paid Apps in each category?

Unsurprisingly the biggest categories have the most paid apps. However, there might be some patterns if we put the numbers of a graph!

We can use the plotly express bar chart examples https://plotly.com/python/bar-charts/#bar-chart-with-sorted-or-ordered-categories

and the .bar() API reference https://plotly.com/python-api-reference/generated/plotly.express.bar.html#plotly.express.barto create a bar chart:

The key is using the color and barmode parameters for the .bar() method. To get a particular order, you can pass a dictionary to the axis parameter in .update_layout().

What we see is that while there are very few paid apps on the Google Play Store, some categories have relatively more paid apps than others, including Personalization, Medical and Weather. So, depending on the category you are targeting, it might make sense to release a paid-for app.

But this leads to many more questions:

  1. How much should you charge? What are other apps charging in that category?

  2. How much revenue could you make?

  3. And how many downloads are you potentially giving up because your app is paid?

Let’s try and answer these questions with some Box plots:

Box plots show us some handy descriptive statistics in a graph - things like the median value, the maximum value, the minimum value, and some quartiles. Here’s what we’re after:

Let's create a box plot that shows the number of Installs for free versus paid apps. How does the median number of installations compare? Is the difference large or small?

Use the Box Plots Guide https://plotly.com/python/box-plots/ and the .box API reference https://plotly.com/python-api-reference/generated/plotly.express.box.html to create the chart above.

From the hover text in the chart, we see that the median number of downloads for free apps is 500,000, while the median number of downloads for paid apps is around 5,000! This is massively lower.

But does this mean we should give up on selling a paid app? Let’s see how much revenue we would estimate per category.

If an Android app costs 30,000 to develop, then the average app in very few categories would cover that development cost. The median paid photography app earned about 20,000. Many more app’s revenues were even lower - meaning they would need other sources of revenue like advertising or in-app purchases to make up for their development costs. However, certain app categories seem to contain a large number of outliers that have much higher (estimated) revenue - for example in Medical, Personalisation, Tools, Game, and Family.

So, if you were to list a paid app, how should you price it? To help you decide we can look at how your competitors in the same category price their apps.

What is the median price for a paid app? Let's compare pricing by category by creating another box plot. But this time examine the prices (instead of the revenue estimates) of the paid apps. We can use {categoryorder':'max descending'} to sort the categories.

The median price for an Android app is 2.99.

However, some categories have higher median prices than others. This time we see that Medical apps have the most expensive apps as well as a median price of 5.49. In contrast, Personalisation apps are quite cheap on average at 1.49. Other categories which higher median prices are Business (4.99) and Dating (6.99). It seems like customers who shop in these categories are not so concerned about paying a bit extra for their apps.